1 Data Overview
1.1 Source
The data we will be analysing is a sample containing answers to questions asked in an online survey by DATA2X02 students. This data was collected through an online form, whose link was given to all students of DATA2X02. The survey contained many questions of various types including multiple-choice, free-form, etc. Let’s begin with reading the dataset and have a look at it using dplr library (A part of tidyverse package).
#Importing some important libraries
library(tidyverse)
library(plotly)
library(janitor)
library(knitr)
library(forcats)
library(ICSNP)
library(visdat)
library(kableExtra)
set.seed(2019)
survey = read_csv("https://docs.google.com/spreadsheets/d/1hNeBWmVXTLyUwl6b9yTnfPn2jfbQuBdPJuhJu-FZJSo/export?gid=690048681&format=csv")
glimpse(survey)## Observations: 110
## Variables: 22
## $ Timestamp <chr> ...
## $ Identifier <chr> ...
## $ Gender <chr> ...
## $ `Postcode of where you live during semester` <dbl> ...
## $ `What statistics courses have you taken?` <chr> ...
## $ `How many university clubs are you a member of?` <dbl> ...
## $ `How long has it been since you last went to the dentist?` <chr> ...
## $ `On average, how many hours per week did you spend on university work last semester?` <dbl> ...
## $ `What is your favourite social media platform?` <chr> ...
## $ `How many siblings do you have?` <dbl> ...
## $ `Did you have a pet growing up?` <chr> ...
## $ `Do you currently live with your parents?` <chr> ...
## $ `How many hours a week do you spend exercising?` <chr> ...
## $ `What is your eye colour?` <chr> ...
## $ `How many hours per week do you work in paid employment?` <dbl> ...
## $ `What is your favourite season of the year?` <chr> ...
## $ `What is your shoe size?` <dbl> ...
## $ `How tall are you?` <dbl> ...
## $ `How often do you floss your teeth?` <chr> ...
## $ `Do you wear glasses or contacts?` <chr> ...
## $ `What is your dominant hand?` <chr> ...
## $ `How do you like your steak cooked?` <chr> ...
1.2 Variables and Data Cleaning
As we can see, the names of columns are unnecessarily long and not R-friendly. We will use janitor library’s clean_names() and rename() functions to change column names.
survey = survey %>%
clean_names() %>%
rename(
postcode = postcode_of_where_you_live_during_semester,
units = what_statistics_courses_have_you_taken,
clubs = how_many_university_clubs_are_you_a_member_of,
dentist = how_long_has_it_been_since_you_last_went_to_the_dentist,
study = on_average_how_many_hours_per_week_did_you_spend_on_university_work_last_semester,
social_media = what_is_your_favourite_social_media_platform,
siblings = how_many_siblings_do_you_have,
exercise = how_many_hours_a_week_do_you_spend_exercising,
pet_growing_up = did_you_have_a_pet_growing_up ,
live_with_parents = do_you_currently_live_with_your_parents,
eye_colour = what_is_your_eye_colour,
hrs_employed = how_many_hours_per_week_do_you_work_in_paid_employment,
fav_season = what_is_your_favourite_season_of_the_year,
shoe_size = what_is_your_shoe_size,
height = how_tall_are_you,
floss_frequency = how_often_do_you_floss_your_teeth,
glasses = do_you_wear_glasses_or_contacts,
handedness = what_is_your_dominant_hand,
doneness = how_do_you_like_your_steak_cooked
)We can also see that many variable types are not properly accessed by R. For example variable exercise is accessed as chr but it is actually a numeric type variable.mutate function is used to change variable types:
postcodeto factor typeclubsto integer typedentistto factor typesiblingsto integer typepet_growing_upto factor typelive_with_parentsto factor typeexerciseto numeric typefav_seasonto factor typefloss_frequencyto factorglassesto factor typehandednessto factor typedonenessto factor type
survey = survey %>%
mutate(
postcode = as.factor(postcode),
clubs = as.integer(clubs),
dentist = as.factor(dentist),
siblings = as.integer(siblings),
pet_growing_up = as.factor(pet_growing_up),
live_with_parents = as.factor(live_with_parents),
exercise = as.numeric(exercise),
fav_season = as.factor(fav_season),
floss_frequency = as.factor(floss_frequency),
glasses = as.factor(glasses),
handedness = as.factor(handedness),
doneness = as.factor(doneness)
)This data needed a lot of cleaning, we will be cleaning specific variables when we analyse them. NA values are not prominent as we can see below (using vis_miss() function):
Most NULL values are present in the identifier(13.64%) and postcode(3.64%) column. Students might have left postcode variable as NULL as they might not be comfortable sharing their location in the survey.
We will handle the NULL values in respective columns when need be.
2 Data Analysis
2.1 Is this sample random?
This sample is not random as:
The students who are not very attentive about the UOS probably won’t know about the survey. Thus, this data doesn’t represent all the students but only those who are attentive in class.
Some entries are filled unrealistically making the sample less represntative of the population.
As Ed platform is not compulsory to be active on, some students might have missed the announcement and hence, not filled the form
2.2 Bias
Selection bias: The students filling the form might be only those who are regular with the unit.
- Measurement bias:
- Recall bias: A person might not remeber stuff like height and shoe size.
- Sensitive questions: Students might not answer or answer dishonestly to the questions they find unfavourable,for example, they might be uncomfortable to share their postcode and the number of hours they study .
Sampling method: Some students might not check the announcements on Ed platform regularly making them unlikely to fill the survey.
2.3 Variables with bias
Study: Only the regular students are probable to fill the survey thus making the sample less representative of the actual population. Also, some students might not have answered the question honestly due to reasons like peer pressure, insecurity, etc…
Clubs: Some students who are highly active in many clubs might not be able to attend classes regularly.
Hours Employed: Some students who work many hours might not be able to maintain the regularity with the Unit affecting their ability to fill the survey on time.
Height and Shoe size: Student might not remember their exact height and would have filled an approximated guess.
Postcode: Students might feel insecure in sharing their location.
2.4 Exercise done by vegetarians and non-vegetarian
Having a look at the distribution of the doneness column and representing it clearly using kable() function:
survey %>%
tabyl(doneness) %>%
adorn_pct_formatting() %>%
kable() %>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)| doneness | n | percent | valid_percent |
|---|---|---|---|
| Blue | 1 | 0.9% | 0.9% |
| I don’t eat red meat | 7 | 6.4% | 6.4% |
| Medium | 20 | 18.2% | 18.3% |
| Medium-rare | 42 | 38.2% | 38.5% |
| Medium-well done | 23 | 20.9% | 21.1% |
| Rare | 4 | 3.6% | 3.7% |
| Well done | 12 | 10.9% | 11.0% |
| NA | 1 | 0.9% |
|
We have to categorize all the options under Vegetarian or Non-Vegetarian categories respectively to analyse the data (using mutate() function).
diet_ex = survey %>%
drop_na(doneness) %>%
mutate(
diet = case_when(
doneness == "I don't eat red meat"~"Veg",
T~"Non_Veg"
)
) %>%
drop_na(exercise) %>%
select(diet, exercise)Having a look at the exercise column using kable() function:
diet_ex %>%
summary(exercise) %>%
kable() %>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)| diet | exercise | |
|---|---|---|
| Length:109 | Min. :0.000e+00 | |
| Class :character | 1st Qu.:1.000e+00 | |
| Mode :character | Median :3.000e+00 | |
| NA | Mean :8.462e+16 | |
| NA | 3rd Qu.:6.000e+00 | |
| NA | Max. :9.223e+18 |
Maximum value seems to be quite unrealistic. So, we have to make some changes to this column to remove unrealistic outliers (for example, exercise more than 24*7 hours a week) which might affect our analysis (using mutate function).
diet_ex = diet_ex %>%
mutate(
diet = as.factor(diet),
exercise = case_when(
exercise > 24*7 ~ NA_real_,
T ~ exercise
)
) %>%
drop_na(exercise)
# Also removing unnecessary outliers out of the main dataset
survey = survey %>%
mutate(
exercise = case_when(
exercise > 24*7 ~ NA_real_,
T ~ exercise
)
) %>%
drop_na(exercise)Now, we have our cleaned data ready to be analysed. As we have to check if there is a significant difference in exercising habits of people who are vegetarian and non-vegetarian, let’s first check the normality of these two sets by making boxplots and qqplots using ggplot2 and plotly library.
box_diet_ex = diet_ex %>%
ggplot(aes(x = diet, y = exercise)) +
geom_boxplot() +
geom_point()
qq_diet_ex = diet_ex %>% ggplot(aes(sample = exercise)) +
geom_qq() +
geom_qq_line() +
facet_wrap(~diet)
subplot(ggplotly(box_diet_ex), ggplotly(qq_diet_ex))The box plot for two groups shows that they are not quite symmetric which does not support the normality assumption. There is also an outlier in the non-vegetarian group which further makes the normality assumption worse.
On the right, we can see the qq-plot of the two groups. For the non-vegetarian people the extremes are pretty substantial, which is quite against the normality assumption. Vegetarian people’s qq points seem to roughly lie on the qq line but it might not be correct to assume normality because of less number of people on the Vegetarian side.
As we can see from the above qq plot and the below given density plot, we can assume that they roughly follow the same distribution. The dots in the qq-plot also seem to be in similar shape.
dense_diet_ex = diet_ex %>%
ggplot(aes(x = exercise, fill = diet)) +
geom_density(alpha = 0.5)
ggplotly(dense_diet_ex)So, the conclusion is that we can not assume normality for this sample. Further, we can roughly assume that they come from the same distribution. Therefore, we can use wilcoxon rank-sum test as it drops the normality assumption.
Specifying the hypothesis properly:
Hypothesis: \(H_{0}: \mu_{x} = \mu_{y}\) vs \(H_{1}: \mu_{x} \not= \mu_{y}\), where \(\mu_{x}\) is the mean of the vegetarian population’s (All students of the class) exercising hours and \(\mu_{y}\) is the mean of the non-vegetarian population’s exercising hours.
Assumption: \(X_{i}\) and \(Y_{i}\) are independent and follow the same distribution but differ by a shift, where \(X_{i}\) are all the observations of the exercise done by vegetarian students in the sample and \(Y_{i}\) are that of non-vegetarian students, \(i \in 1,2,\dots,107\)
As all the students filled their answers independently, all observations are independent. Also, from the above discussion and the density plot, we can safely assume that they follow the same distribution possibly with a shift.
A short summary of each group’s exercising hours using summarise() function:
diet_ex %>%
group_by(diet) %>%
summarize(
n = n(),
mean = mean(exercise),
sd = sd(exercise)
) %>%
kable() %>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)| diet | n | mean | sd |
|---|---|---|---|
| Non_Veg | 100 | 4.0125 | 3.866192 |
| Veg | 7 | 5.0000 | 4.690416 |
- Test Statistic: \(W = R_{1} + R_{2}+\dots+R_{n_{x}}\). Under \(H_{0}\), \(W\) follows the \(WRS(n_{x}, n_{y})\) distribution, where \(n_{x} = 100\), \(n_{y}=7\) and \(R_{i}\) are the ranks of the first group(\(X_{i}\)).
#Calculating test statistic
w = diet_ex %>%
arrange(diet) %>%
mutate(
rank = rank(exercise)
) %>%
slice(1:100) %>%
select(rank) %>%
sum()
w## [1] 5354.5
- Observed Test Statistic: \(t_{0} =\) 5354.5
##
## Wilcoxon rank sum test
##
## data: exercise by diet
## W = 304.5, p-value = 0.5638
## alternative hypothesis: true location shift is not equal to 0
- P-value: Also, we know that if the test statistic
wis less than its minimum value i.e. \(n_{x}(n_{x} + 1)/2\), then the p-value is \(2P(W\leq w)\).
w = 5354.5 and \(n_{x}(n_{x} + 1)/2\) = 5050
So,
\(2P(W\leq w) =\) 0.563797
- Decision: As the p-value 0.563797 is quite greater than 0.05, we don’t have enough evidence to reject \(H_{0}\). So, the data is consistent with the null hypothesis that there is not a significant difference between the exercising habits of vegetarian and non-vegetarian students.
2.5 Exercise and participation in clubs
Having a look at the distribution of the clubs column using tabyl() function:
survey %>%
tabyl(clubs) %>%
adorn_pct_formatting() %>%
kable() %>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)| clubs | n | percent |
|---|---|---|
| 0 | 41 | 38.3% |
| 1 | 22 | 20.6% |
| 2 | 16 | 15.0% |
| 3 | 18 | 16.8% |
| 4 | 3 | 2.8% |
| 5 | 5 | 4.7% |
| 7 | 1 | 0.9% |
| 10 | 1 | 0.9% |
We have to categorize all the options as either Greater than 3 or Less than or equal to 3 to analyse the data.
club_ex = survey %>%
drop_na(clubs, exercise) %>%
mutate(
clubs = case_when(
clubs > 3 ~ "Greater than 3",
clubs <=3 ~ "Less than or equal to 3"
)
) %>%
select(clubs, exercise)This time, we don’t need to clean the exercise column as we already did that in previous analysis.
Now, we have our cleaned data ready to be analysed. As we have to check if there is a significant difference in exercising habits of people who involve in more than 3 clubs and who don’t, let’s first check the normality of these two sets using graphical summaries plotted with ggplot2 and plotly library.
box_club_ex = club_ex %>%
ggplot(aes(x = clubs, y = exercise)) +
geom_boxplot() +
geom_point()
qq_club_ex = club_ex %>% ggplot(aes(sample = exercise)) +
geom_qq() +
geom_qq_line() +
facet_wrap(~clubs)
subplot(ggplotly(box_club_ex), ggplotly(qq_club_ex))The box plot for two groups shows that they are not quite symmetric which does not support the normality assumption. There is also an outlier in the Less than or equal to 3 group which further makes the normality assumption worse.
On the right, we can see the qq-plot of the two groups. For the Less than or equal to 3 group the extremes are pretty substantial, which is quite against the normality assumption. Greater than 3 group’s qq-points seem to roughly lie on the qq-line but it might not be correct to assume normality because of less number of people on Greater than 3 side of the sample.
dens_clubs_ex = club_ex %>%
ggplot(aes(x = exercise, fill = clubs)) +
geom_density(alpha = 0.5) +
theme_classic()
ggplotly(dens_clubs_ex)As we can see from the above qq plot and density plot, we can assume that they roughly follow the same distribution. The dots in the qq-plot also seem to be in similar shape. They do not seem to follow exactly the same distribution, which can be a limitation on using this test.
So, the conclusion is that we can not assume normality of this sample. Further, we can roughly assume that they come from same distribution. Therefore, we can use wilcoxon rank-sum test as it drops the normality assumption.
Specifying the hypothesis properly:
Hypothesis: \(H_{0}: \mu_{x} = \mu_{y}\) vs \(H_{1}: \mu_{x} \not= \mu_{y}\), where \(\mu_{x}\) is the mean of the
Greater than 3group’s (All students of the class) exercising hours and \(\mu_{y}\) is the mean of theLess than or equal to 3group’s exercising hours.Assumption: \(X_{i}\) and \(Y_{i}\) are independent and follow the same distribution but differ by a shift, where \(X_{i}\) are all the observations of the exercise done by students who involve in more than 3 clubs, in the sample and \(Y_{i}\) are that of students involving in less than or equal to 3 clubs, \(i \in 1,2,\dots,107\)
As all the students filled their answers independently, all observations are independent. Also, from the above discussion and the density plot, we can roughly assume that they follow the same distribution possibly with a shift. This can be a limitation of using this test.
A short summary of each group’s exercising hours using summarize() function:
club_ex %>%
group_by(clubs) %>%
summarize(
n = n(),
mean = round(mean(exercise), 2),
sd = round(sd(exercise), 2)
) %>%
kable() %>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)| clubs | n | mean | sd |
|---|---|---|---|
| Greater than 3 | 10 | 2.30 | 2.06 |
| Less than or equal to 3 | 97 | 4.26 | 4.01 |
- Test Statistic: \(W = R_{1} + R_{2}+\dots+R_{n_{x}}\). Under \(H_{0}\), \(W\) follows the \(WRS(n_{x}, n_{y})\) distribution, where \(n_{x} = 10\), \(n_{y}=97\) and \(R_{i}\) are the ranks of the first group(\(X_{i}\)).
#Calculating test statistic
w = club_ex %>%
arrange(clubs) %>%
mutate(
rank = rank(exercise)
) %>%
slice(1:10) %>%
select(rank) %>%
sum()
w## [1] 401.5
- Observed Test Statistic: \(t_{0} =\) 401.5
##
## Wilcoxon rank sum test
##
## data: exercise by clubs
## W = 346.5, p-value = 0.1355
## alternative hypothesis: true location shift is not equal to 0
- P-value: Also, we know that if the test statistic
wis greater than its minimum value i.e. \(n_{x}(n_{x} + 1)/2\), then the p-value is \(2P(W\geq w)\).
w = 401.5 and \(n_{x}(n_{x} + 1)/2\) = 55
So,
\(2P(W\geq w) =\) 0.1355486
- Decision: As the p-value 0.1355486 is greater than 0.05, we don’t have enough evidence to reject \(H_{0}\). So, the data is consistent with the null hypothesis that there is not a significant difference between the exercising habits of students who involve in more than 3 clubs and those who don’t.
2.6 Living with parents vs Studying hours
Having a look at the distribution of the study column:
survey %>%
tabyl(study) %>%
adorn_pct_formatting() %>%
filter(study > 100) %>%
kable() %>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)| study | n | percent | valid_percent |
|---|---|---|---|
| 167 | 1 | 0.9% | 0.9% |
| 312 | 1 | 0.9% | 0.9% |
As we can see, there are some unrealistically large values present in the data which will interfere with our analysis in the form of unrealistic outliers. Taking care of these values as follows:
Here, I have assigned all the values greater than 18*7 as unrealistic. I have removed on-average 6 hours per day for sleeping and any other activity, as it might add extra hours of study for humans.
survey = survey %>%
mutate(
study = case_when(
study > 18*7 ~ NA_real_,
study < 0 ~ NA_real_,
T ~ as.numeric(study)
)
)
study_parent = survey %>%
drop_na(live_with_parents, study) %>%
select(live_with_parents, study)Now, we have our cleaned data ready to be analysed. As we have to check if there is a significant difference in studying habits of people who live with parents and who don’t, let’s first check the normality of these two sets.
box_study_parent = study_parent %>%
ggplot(aes(x = live_with_parents, y = study)) +
geom_boxplot() +
geom_point()
qq_study_parent = study_parent %>% ggplot(aes(sample = study)) +
geom_qq() +
geom_qq_line() +
facet_wrap(~live_with_parents)
subplot(ggplotly(box_study_parent), ggplotly(qq_study_parent))The box plot for two groups shows that they are roughly symmetric which supports the normality assumption. But, there is an outlier in the Yes group which may result in it’s rejection.
On the right side, we can see the qq-plot of the two groups. For the No group the extremes are a bit further from the theoritical quantiles, which does not support the normality assumption. Yes group’s qq points seem to roughly lie on the qq line except one outlier but it can be roughlt assumed that they follow normality because of large number of observations on Yes group.
As we can see from the above qq plot and the below given density plot, we can assume that they roughly follow the normal distribution.
dens_study_parent = study_parent %>%
ggplot(aes(x = study, fill = live_with_parents)) +
geom_density(alpha = 0.5)
ggplotly(dens_study_parent)Both plots, roughly seem to follow normal distribution, but the No group seem to have bimodality, which might be a limitation while using t-test. They don’t seem to have similar distribution which does not support the use of wilcoxon sign rank test.
So, here we will use permutation test using t-statistic
Specifying the hypothesis properly:
Hypothesis: \(H_{0}: \mu_{x} = \mu_{y}\) vs \(H_{1}: \mu_{x} \not= \mu_{y}\), where \(\mu_{x}\) is the mean of the group of students not living with their parents’ study hours and \(\mu_{y}\) is the mean of the study hours of students living with parents.
Assumption: \(X_{i}\) and \(Y_{i}\) are independent and exchangeable, where \(X_{i}\) are all the observations of the study hours of students not living with parents in the sample and \(Y_{i}\) are that of students living with parents, \(i \in 1,2,\dots,52\)
As all the students filled their answers independently, all observations are independent. Also, as they don’t have any relation, their groups can be exchanged.
A short summary of each group’s exercising hours:
study_parent %>%
group_by(live_with_parents) %>%
summarize(
n = n(),
mean = round(mean(study),2),
sd = round(sd(study),2)
) %>%
kable() %>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)| live_with_parents | n | mean | sd |
|---|---|---|---|
| No | 57 | 27.70 | 14.02 |
| Yes | 47 | 29.04 | 17.13 |
- Test Statistic: \(T = (\overline X - \overline Y)/(s\sqrt n)\). Under \(H_{0}\), not necessarily following \(t_{n-1}\) distribution.
- Observed Test Statistic: \(t_{0} =\) -0.4306861
#Calculating p-value
B = 10000 # number of permuted samples we will consider
permuted_study_parent = study_parent # make a copy of the data
t_null = vector("numeric", B) # initialise outside loop
for(i in 1:B) {
permuted_study_parent$live_with_parents = sample(study_parent$live_with_parents) # this does the permutation
t_null[i] = t.test(study ~ live_with_parents, data = permuted_study_parent)$statistic
}
p_val = 2*mean(t_null < t0)
p_val## [1] 0.662
- P-value: It will be the probability of getting the observed test statistic less than the array of test statistics we got from permutation.
$p~value = 2P(t_null > t_{0}) = $ 0.662
- Decision: As the p-value 0.662 is quite greater than 0.05, we don’t have enough evidence to reject \(H_{0}\). So, the data is consistent with the null hypothesis that there is not a significant difference between the studying habit between students who live with their parents and those who don’t.
Let’s check the result of the permutation t-test with our normal Welch two sample t-test. It might be valid as from the previous discussion we can say that our sample roughly follows normal distribution.
##
## Welch Two Sample t-test
##
## data: study by live_with_parents
## t = -0.43069, df = 88.615, p-value = 0.6677
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.526971 4.845373
## sample estimates:
## mean in group No mean in group Yes
## 27.70175 29.04255
As we can see, that Welch-two sample test gives us the p_value of 0.6677424, which is quite similar to what we got from the permutation test. This further solidifies our decision from the permutation test that, we don’t have enough evidence to reject the null hypothesis.
2.7 Growing up with pet and Living with parents
Here, we will try to find any relation between currently living with parents and growing up with a pet.
Let’s have a look at the contingency table:
parent_pet = table(survey$live_with_parents, survey$pet_growing_up) %>%
as.data.frame() %>%
`colnames<-`(c("live_with_parents", "pet_growing_up", "Freq"))
parent_pet## live_with_parents pet_growing_up Freq
## 1 No No 34
## 2 Yes No 13
## 3 No Yes 25
## 4 Yes Yes 35
Visualizing our table using ggplot2 and plotly library:
parent_pet_bar1 = parent_pet %>%
ggplot(aes(x = live_with_parents, y = Freq, fill = pet_growing_up)) +
geom_bar(stat = "identity")
parent_pet_bar2 = parent_pet %>%
ggplot(aes(x = live_with_parents, y = Freq, fill = pet_growing_up)) +
geom_bar(stat = "identity", position = "fill")
subplot(ggplotly(parent_pet_bar1), ggplotly(parent_pet_bar2))Just a rough look suggests a relation between these two variables, but we have to prove it statistically. Here, we will use ‘Test for Independence’ to check the independence between them.
Hypothesis: \(H_{0}: p_{ij} = p_{∙j}p_{i∙}~, i=1,2;j=1,2,3,4\) vs \(H_{1}:\)Not all equalities hold
Assumption: \(e_{i} = y_{∙j}y_{i∙}/n\ge 5\).
Making a matrix with same values for easier and effective calculations:
y = c(35, 25, 13, 34)
n = sum(y)
c = 2
r = 2
y.mat = matrix(y, nrow = r, ncol = c, byrow = F)
colnames(y.mat) = c("Pet_growing_up", "No_pet_growing_up")
rownames(y.mat) = c("Living_with_parents","Not_living_with_parents")Checking assumptions:
yr = apply(y.mat, 1, sum)
yc = apply(y.mat, 2, sum)
yr.mat = matrix(yr, r, c, byrow = FALSE)
yc.mat = matrix(yc, r, c, byrow = TRUE)
ey.mat = (yr.mat * yc.mat)/sum(y.mat) #Matrix multiplication to get expected values of each cell in y.mat
all(ey.mat >= 5)## [1] TRUE
The assumption seems to be true as all the expected values are greater than 5.
- Test Statistic: \(T = \sum_{i=1}^{r}\sum_{j=1}^c\frac{(y_{ij} - e_{ij})^2}{e_{ij}}\) Under \(H_{0}\), \(T\sim\chi_{(2-1)(2-1)}^2\) approx. The degree of freedom was calculated with the formula \(df = (r-1)(c-1)\), where r is number of rows and c is number of columns.
Using chisq.test function for the Pearson’s Chi-squared test:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: y.mat
## X-squared = 8.823, df = 1, p-value = 0.002974
Observed Test Statistic: \(t_{0} = 8.823\)
P-value: \(P(\chi_{1}^2 \geq 8.823) = 0.002974\)
Decision: Since the p-value (0.002974) is a quite smaller than 0.05, it is concluded that there is enough evidence to reject the null hypothesis and this data set is not consistent with the null hypothesis stating that living with parents and growing up with parents are independent of each other.
3 DATA2902 Question
3.1 Hotelling-T test
Let’s assume that another australian university claims to have the means of study hours, employment hours and height to be 20 hours, 10 hours and 180 cm respectively. To check if their means are the same as ours, let us perform a multivariate mean test using Hotelling’s T-test.
Cleaning the employment hours, studying hours and height column i.e. removing some unnecessary outliers. Again considering only 18 hours out of 24 hours as productive, removing sleep and other activities (using mutate function).
survey = survey %>%
mutate(
hrs_employed = case_when(
hrs_employed > 18*7 ~ NA_real_,
hrs_employed < 0 ~ NA_real_,
T ~ hrs_employed
)
)
survey = survey %>%
mutate(
study = case_when(
study > 18*7 ~ NA_real_,
study < 0 ~ NA_real_,
T ~ as.numeric(study)
)
)
survey = survey %>%
mutate(
height = case_when(
height > 230 ~ NA_real_,
height < 1.1 ~ NA_real_,
height < 2.30 ~ height*100,
T ~ as.numeric(height)
)
)Considering visual aid (Boxplot and QQplot) for normality assumption using ggplot2 and plotly libraries:
- Studying hours
hotelling_sample = survey %>%
drop_na(study, hrs_employed, height) %>%
select(study, hrs_employed, height)
study_plot1 = hotelling_sample %>%
ggplot(aes(x = "Study hours", y = study)) +
geom_boxplot() +
geom_jitter(width = 0.2, size = 1.5, color = "blue")
study_plot2 = hotelling_sample %>%
ggplot(aes(sample = study)) +
geom_qq() +
geom_qq_line()
subplot(study_plot1, study_plot2)- Employment hours:
hr_emp_plot1 = hotelling_sample %>%
ggplot(aes(x = "Hours Emploted", y = hrs_employed)) +
geom_boxplot() +
geom_jitter(width = 0.2, size = 1.5, color = "blue")
hr_emp_plot2 = hotelling_sample %>%
ggplot(aes(sample = hrs_employed)) +
geom_qq() +
geom_qq_line()
subplot(hr_emp_plot1, hr_emp_plot2)- Height:
height_plot1 = hotelling_sample %>%
ggplot(aes(x = "Height", y = height)) +
geom_boxplot() +
geom_jitter(width = 0.2, size = 1.5, color = "blue")
height_plot2 = hotelling_sample %>%
ggplot(aes(sample = height)) +
geom_qq() +
geom_qq_line()
subplot(height_plot1, height_plot2)As, we can see from above, Study hours has 2 outliers and a roughly symmetric boxplot. On extremes, there is a deviation from the theorotical quantiles. In the employment hours boxplot, there is limited symmetry, it’s qq-plot also shows huge variation. Height plots show much better symmetry and less variation in qq-plot. So, assuming the normality of these sets, might show deviation from actual results. This can be a huge limitation for hotelling’s t test as it assumes multivariate normality.
Specifying the hypothesis properly:
Hypothesis: \(H_{0}: \boldsymbol \mu~=\boldsymbol \mu_{0}\) vs \(H_{1}: \boldsymbol \mu\not=\boldsymbol \mu_{0}\), where \(\boldsymbol \mu\) is the array containing mean values from our sample and \(\boldsymbol \mu_{0}\) is the array containing mean values from the other university.
Assumption: \(\boldsymbol X_{1},\) \(\boldsymbol X_{2},\) \(\dots,\) \(\boldsymbol X_{p}\) are independent and follow multivariate normal distribution \(N_{p}(\boldsymbol \mu, \boldsymbol \sum)\) where \(\boldsymbol X_{i}\) are all the observations from our sample and \(p\) is the number of constraints checking simultaneously, valued at 3.
As all the students filled their answers independently, all observations are independent. Also, we are assuming that they follow multivariate normal distribution as discussed above.
- Test Statistic: \(T^2=\) \(n\) \((\) \(\overline X - \boldsymbol \mu_{0}\) \()^T\) \(\boldsymbol S^-1\) \((\) \(\overline X\) \(-\) \(\boldsymbol \mu_{0}\) \()\). Under \(H_{0}\), \(T^2 \tilde~F_{p, n-p}\) as n is large.
#Calculating test statistic
v = var(hotelling_sample)
bmean = apply(hotelling_sample, 2, mean)
mu0 = c(20, 10, 180)
n = nrow(hotelling_sample)
t = n*t(bmean - mu0) %*% solve(v) %*% (bmean - mu0)- Observed Test Statistic: \(t_{0} =\) 83.6845157
##
## Hotelling's one sample T2-test
##
## data: hotelling_sample
## T.2 = 27.342, df1 = 3, df2 = 99, p-value = 5.779e-13
## alternative hypothesis: true location is not equal to c(20,10,180)
P-value: \(p~value = P(F_{3,99} \geq t_{0}) =\) 5.778710810^{-13}
Decision: As the p-value (<0.0001) is quite smaller than 0.05, we have enough evidence to reject \(H_{0}\). So, the data is not consistent with the null hypothesis that the mean of study hours, employment hours and height of other australian university is same as our class of DATA2X02.
4 Conclusion
Various biases discussed above like Measurement bias, Selection bias and Sampling method are the liimitations in this analysis as they prevent the sample to be a proper representative of the concerned population. Many tests performed above are based on assumptions and rough approximation making them prone to errors.
5 References
Create Elegant Data Visualisations Using the Grammar of Graphics Ggplot2.tidyverse.org. (2019). [online] Available at: https://ggplot2.tidyverse.org/
A Grammar of Data Manipulation Dplyr.tidyverse.org. (2019). [online] Available at: https://dplyr.tidyverse.org/
Larsen, R. J., & Marx, M. L. (2012). An introduction to mathematical statistics and its applications. Boston: Prentice Hall.
v1.1-1, I. (2019). ICSNP package | R Documentation. [online] Rdocumentation.org. Available at: https://www.rdocumentation.org/packages/ICSNP/versions/1.1-1
Plot.ly. (2019). plotly. [online] Available at: https://plot.ly/r/
v1.2.0, j. (2019). janitor package | R Documentation. [online] Rdocumentation.org. Available at: https://www.rdocumentation.org/packages/janitor/versions/1.2.0
v1.2.1, t. (2019). tidyverse package | R Documentation. [online] Rdocumentation.org. Available at: https://www.rdocumentation.org/packages/tidyverse/versions/1.2.1
v1.25, k. (2019). knitr package | R Documentation. [online] Rdocumentation.org. Available at: https://www.rdocumentation.org/packages/knitr/versions/1.25
v0.3.0, f. (2019). forcats package | R Documentation. [online] Rdocumentation.org. Available at: https://www.rdocumentation.org/packages/forcats/versions/0.3.0
v0.0.4.9500, v. (2019). visdat package | R Documentation. [online] Rdocumentation.org. Available at: https://www.rdocumentation.org/packages/visdat/versions/0.0.4.9500